Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

168 ◾ Bioinformatics

process of mapping reads to a reference genome producing a SAM/BAM file that contains

the mapping information. Refer to Chapter 2 for the read mapping and the content of

SAM/BAM files. When dealing with RNA-Seq data, we can either align the reads to a

reference genome or a reference transcriptome. When we align RNA-Seq to a eukaryotic

reference genome, we must use an aligning program like STAR that is able to detect the

splice junctions. The reads in this case will map to the exons leaving introns and other

non-coding regions of the genome uncovered. On the other hand, when aligning RNA-Seq

reads to a reference transcriptome, the aligned reads may cover the entire sequence. This

strategy is preferable when reads are very short (less than 50 bases). The downside of align-

ing reads to a transcriptome is that we may miss some novel genes since the transcriptome

is made up of only known transcripts. As discussed in Chapter 2, there are several align-

ers; however, for RNA-Seq, we prefer to use a splice-aware aligner that is able to introduce

long gaps to span introns when aligning reads to a reference genome. The commonly used

aligners for RNA-Seq data include STAR [5], segemehl [6], GEM [7], BWA [8], BWA-MEM

[8], and BBMap [9].

Before deciding on which of the aligners to use with RNA-Seq reads, make sure that the

aligner is splicing-aware and able to distinguish between reads aligned across exon–intron

boundaries and reads with short insertions [10]. The splicing-aware aligners include STAR

[5], GSNAP [11], MapSplice [12], RUM [13], and HISAT2 [14]. Each of these aligners has

different advantages and disadvantages in terms of memory efficiency, performance, and

speed. Refer to the user guide of any of these aligners to learn more about them. We will

use STAR (Spliced Transcripts Alignment to a Reference) as an example aligner for align-

ing RNA-Seq data. Several studies found that STAR is one of the most accurate aligners of

RNA-Seq reads [15]. However, STAR requires a large memory for indexing and mapping.

The reference sequence must be indexed by STAR before alignment. STAR begins mapping

process by aligning the longest reads that exactly match a single or multiple location on the

reference sequence. For partially aligned reads, STAR will attempt to align the unmapped

region to a different region. Those parts of the reads which align to different locations of

the reference sequence are called seeds. If STAR does not find an exact match to a read on

the reference sequence, the read will be extended by inserting gaps. If the extension does

not give a good alignment, it will be removed. In the second step of the STAR alignment

process, multiple seeds will be clustered based on proximity to a set of anchor seeds. The

clustered seeds are stitched together based on the best alignment score [5].

When reads are mapped to a reference sequence, the percentage of mapped reads reflects

the quality of the alignment. Low percentage indicates contamination of the DNA. Read

coverage and depth on exons are other factors that determine alignment quality.

Above, we have downloaded the FASTQ files in the directory “fastq”. We can map reads

in the FASTQ files to a reference genome using STAR program. STAR is a short read aligner

designed to align RNA-Seq reads to a reference sequence (genome or transcriptome). For

aligning reads in the FASTQ files using STAR, we need to download a reference sequence

together with its annotation file in GTF format. Since our example FASTQ files are from

human samples, we need to download the latest human genome and its annotation file.